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Abstract 

Background: Protein remote homology detection is one of the central problems in bioinformatics, which is 
important for both basic research and practical application. Currently, discriminative methods based on Support 
Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the 
position information of amino acids or other protein building blocks is a key step to improve the performance of 
the SVM-based methods. 

Results: Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT 
SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the 
distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n- 
gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the 
frequency profiles. These two methods are position dependent approaches incorporating the sequence-order 
information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 
families and 23 superfamilies. Experimental results showed that these two new methods are very promising. 
Compared with the position independent methods, the performance improvement is obvious. Furthermore, the 
proposed methods can also provide useful insights for studying the features of protein families. 

Conclusion: The better performance of the proposed methods demonstrates that the position dependant 
approaches are efficient for protein remote homology detection. Another advantage of our methods arises from 
the explicit feature space representation, which can be used to analyze the characteristic features of protein 
families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/ 
index.jsp 



Background 

Protein remote homology detection is a central problem in 
computation biology, which refers to the detection of evo- 
lutional homology in proteins with low similarities. 
Through evolution, structure is more conserved than 
sequence. Thus, knowledge of protein structure and evolu- 
tion is important for predicting the functions of proteins, 
which will promote protein binding study, rational drug 
design, and many other related fields and applications. 
Protein remote homology detection identifies proteins 
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from different families, and therefore can be applied to 
predict structure and function of specific proteins. 

Unfortunately, protein remote homology detection is still 
a changing problem in bioinformatics and therefore accu- 
rately and efficiently computational approaches are needed. 
During the past two decades, some computational methods 
have been proposed for protein remote homology detec- 
tion, which can be mainly divided into two major cate- 
gories: generative methods and discriminative algorithms. 
Early solutions of protein remote homology detection were 
generative methods, which trained a model to represent a 
protein family and then evaluated a query sequence accord- 
ing to this model. For example, BLAST [1], PSI-BLAST [2], 
and Hidden Markov Model (HMM) [3] searched the 
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protein database based on a model built by both positively 
labeled and unlabeled proteins. Generative methods didn't 
perform well because they only make use of positive train- 
ing samples to build the models for the prediction. Some 
generative methods have been improved by developing 
more sensitive profiles, for example, HHsearch method [4] 
used the hidden Markov model to calculate a novel profile. 
COMPASS [5] generated numerical profiles and con- 
structed optimal profile-profile alignments. FFAS [6] was 
based on a new procedure for profile generation that takes 
into account all the relations within the family. Some online 
servers are available, including FORTE [7], RANKPOOP 
[8], webPRC [9], PHYRE [10], GenThreader [11], COMA 
[12], and, Bioshell [13]. 

The discriminative approaches mark protein sequences 
with a set of labels, positive if they are in the protein 
family and negative otherwise. These methods attempt to 
learn the distinction between different classes. Both posi- 
tive and negative samples are used in training for these 
approaches. The most popular discriminative methods 
for remote homology detection problem are based on the 
Support Vector Machine (SVM) [14]. SVM learns a linear 
decision boundary from both positive and negative train- 
ing samples to discriminate between the unseen protein 
sequences. A key feature of SVM is that it needs fixed 
length input vector. Thus some researchers have intro- 
duced different types of kernel functions and feature vec- 
tors for protein representation. Most of these kernel 
functions were based on sequence composition and pro- 
files. For features based on sequence composition, some 
methods were based on the similarity between a protein 
sequence and other sequences in the training sets. For 
example, Fisher kernel [15] and SVM-Pairwise [16] mea- 
sured the similarity from the local alignment between 
proteins, but the alignment score fallen into a twilight 
zone when the protein sequence similarity is below 35% 
at the amino acid level [17]. Later, these methods were 
improved by introducing several kernels, such as LA ker- 
nel [18], SVM-HUSTLE [19]. However these methods 
ignored the information from the protein structure and 
evolutionary information, which led to limited perfor- 
mance improvement. Some kernels were based on 
sequence features, whose feature vector were calculated 
from the subsequences, incorporating the protein struc- 
ture information or amino acid position information. For 
instance, in Motif kernel [20], a protein sequence was 
represented in a vector space indexed by a set of motifs 
over a alphabet and subsequences. Spectrum Kernel [21] 
searched all possible subsequences of length k from a 
alphabet to form a feature map. SVM-I-sites [22] 
encoded structure information into the feature vectors. 
Mismatch kernel [23] was calculated based on shared 
occurrences of (k, w)-patterns in the data. LSA [24] 
improved the performance of building-block-based 



methods. SVM-N-Peptide [25] reduced the size of amino 
acid alphabet for increasing values of k. The performance 
of the sequence-based methods is not satisfying because 
these methods only use the sequence features without 
using the evolutionary information or 3-dimension struc- 
ture. In profile-based methods, the feature vectors were 
extracted from profiles which contain the evolutional 
information. These methods showed superior perfor- 
mance than the sequence-based methods. This is because 
that a profile is much richer than an individual protein 
sequence in encoding information. Protein evolution 
involves changes of single residues, insertions and dele- 
tions of several residues, gene doubling, and gene fusion. 
With these changes accumulated for a long period of 
time, many similarities between initial and resultant pro- 
tein sequences are gradually eliminated, but the corre- 
sponding proteins may still share many common 
features, such as similar structure and function. Profile 
describes this kind of evolutionary information, and 
therefore the profile-based kernels outperform the 
sequence-based kernels for protein remote homology 
detection. For instance, SW-PSSM [26] introduced two 
classes of kernel functions which were constructed from 
protein similarity measures by employing effective pro- 
file-to-profile scoring schemes. Profile kernel [27] used 
probabilistic profiles to define position-dependent muta- 
tion neighborhoods along protein sequences. A Top-n- 
gram-based approach [28] was proposed for protein 
remote homology detection. Top-n-gram can be viewed 
as a profile-based building block of proteins obtained by 
combining the most frequent amino acids in the profiles. 
The proteins were converted into fixed length vectors by 
the occurrences of each Top-n-gram and input into SVM 
for the prediction. Although, this method showed some 
improvements in predictive performance, this method 
completely ignores the sequence-order information. 
Recent studies showed that the sequence-order effects 
are relevant for protein remote homology detection. For 
example, SVM-PDT [29] incorporated the sequence- 
order information by considering the amino acid physi- 
cochemical properties of any two residues in a protein 
within a given distance. ODH [30] provided the basis dis- 
tance histograms for any possible pair of Aimers based on 
distances between short oligomers, which outperformed 
other position independent approaches. In ACC method 
[31], the sequence-order information was captured by the 
autocross-covariance (ACC) transformation. SVM- 
HMMSTR [32] can capture the sequential ordering of 
the local structures. SVM-RQA [33] used the recurrence 
quantification analysis (RQA) to detect the autocorrela- 
tion patterns along the protein sequences. 

Motivated by the successful of the position dependent 
methods, in this study, we extend the Top-n-gram- 
based method [28] by considering the sequence-order 
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information, which is measured by all the possible Top- 
n-gram pairs within a given distance. In this approach, 
first, each amino acids in a protein sequence are con- 
verted into Top-n-grams based on the frequency profiles 
calculated from multiple sequence alignment. Second, 
the feature vector is calculated by the occurrences of all 
the Top-n-gram pairs within a given distance threshold 
cutoff. Finally, this feature space is input into SVM for 
the prediction. 

Methods 

As shown by a series of publications [34-38], to develop a 
useful statistical prediction method or model for a biolo- 
gical system, one needs to engage the following proce- 
dures: (i) construct or select a valid benchmark dataset to 
train and test the predictor; (ii) formulate the samples 
with an effective mathematical expression that can truly 
reflect their intrinsic correlation with the target to be 
predicted; (iii) introduce or develop a powerful algorithm 
(or engine) to operate the prediction; (iv) properly per- 
form cross-validation tests to objectively evaluate the 
anticipated accuracy of the predictor; (v) provide the 
downloadable source code or a web-server for the predic- 
tion method. Below, let us describe how to deal these 
procedures. 

Dataset description 

SCOP 1.53 superfamily benchmark 

We used a common benchmark [39] to evaluate the per- 
formance of our methods for protein remote homology 
detection, which can be downloaded at http://noble.gs. 
washington.edu/proj/svm-pairwise/. This benchmark has 
been used by many studies, which can provide good com- 
parability with previous approaches [4,16,18,28-30,35,36]. 
There are 54 families and 4352 proteins selected from 
SCOP version 1.53. All protein sequences were extracted 
from the Astral database [40] and no pairwise alignments 
is higher than an E-value of 10" 25 . Proteins within one 
SCOP family were taken as positive test samples, and pro- 
teins outside the family but within the same superfamily 
were taken as positive training samples. Negative samples 
were selected from outside of the superfamily and were 
separated into training and test sets. 

Distance-based Top-n-gram (DT) and distance-based 
Residue (DR) 

In this study, two approaches were proposed to convert 
protein sequences into feature vectors, including Distance- 
based Top-n-gram approach (DT) and Distance-based 
Residue approach (DR). First of all, we will introduce the 
process of the Distance-based Top-n-gram approach. 

In previous study, a Top-n-gram-based method [28] 
was proposed for protein remote homology detection, 



which showed better predictive performance than some 
state-of-the-art methods, including SVM-LA [18], SVM- 
pairwise [16], and SVM-pattern [24]. Top-n-gram can 
be viewed as a profile-based building block of protein 
sequences, which contains the evolutionary information 
extracted from frequency profiles. Each amino acid in a 
protein sequence can be converted into a corresponding 
Top-n-gram by combining the top n most frequency 
amino acids in the corresponding column of a frequency 
profile, and the order of the amino acids in a Top-n- 
gram reflects the different importance of these amino 
acids during evolution. By replacing all the amino acids 
in a protein with their corresponding Top-n-grams, a 
protein sequence can be represented as a sequence of 
Top-n-grams instead of a sequence of amino acids. For 
more details of Top-n-gram, please refer to the study 
[28]. 

In order to incorporate the sequence-order informa- 
tion into the prediction, a Distance-based Top-n-gram 
(DT) approach was proposed, which extends the original 
Top-n-gram-based feature vector by considering the 
relative position information of Top-n-gram pairs in 
protein sequences. The proposed feature vector was cal- 
culated by counting the occurrences of all possible Top- 
n-gram pairs within a certain distance threshold. In this 
study, the Top- 1 -gram was selected to construct the 
Distance-based Top-n-gram feature vector in order to 
reduce the dimension of the feature vectors and reduce 
the computational cost. Therefore, we will introduce the 
proposed Distance-based Top-n-gram approach based 
on Top-l-gram. 

Given an alphabet of Top-l-grams A(A, R, D, C, Q, E, 
H, I, G, N, L, K, M, F, P, S, T, W, Y, V), we consider 
the distances between all Top-l-gram pairs in a Top-l- 
gram sequence S", which is capable of measuring the 
position information of the Top-l-grams sequence S'. 
Firstly, we define a distance d between Top-l-gram pair 
(t ; , tj), which means that Top-l-gram f, occurs before 
Top-l-gram tj and the distance between t t and tj is d. 
Given a distance threshold d M AX> we set the maximum 
distance between Top-l-gram pair (t,-, tj) as d MAX . Sec- 
ondly, we count the occurrences of these distances 
between all Top-l-gram pairs. Thus for any distance 
d < d M AX> we can build a distance-based feature vector 
of S according to: 

nK'l-J [TVS'), TVS') T° V (S')] (d=0) m 

1[TV(S'), TV(S') T^wCS')] (l<d<A«x) w 

Where T 0 ;(S') is the occurrences of Top-l-gram t t , 
which represents the importance of each Top-l-gram 
occurring in S"; T d 1} {S') is the occurrences of Top-l- 
gram pair (ti, tj). The feature vector of S" is achieved by 
combining all the Top-l-gram pairs at different 
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distances and the final feature vector can be represented 
as: 

F4«*(S') = [D 0 (S'),D 1 (S') B dMAX {S')] (2) 

The dimension of the feature vector is 20 + 20 * 20 * 
dMAx> where 20 is the size of the alphabet of Top-1- 
grams. 

In order to help the readers to further understand 
the process of converting a protein sequence into a 
feature vector, a specific example is given and shown 
in Figure 1. Given a protein sequence S = 'AGLP', each 
amino acid in S can be converted into a Top-l-gram, 
and therefore S can be represented as a sequence of 
Top-l-gram S" (KFFK). S' contains the evolutionary 
information extracted from frequency profile. If the 
distance threshold d MAX is set as 2, the occurrences of 
all Top-l-gram pairs at distance 0, 1, 2 are counted. 
Then the feature vector is obtained by combining the 
occurrences of Top-l-gram pairs at distance 0, 1, and 2. 
The algorithm of construing the Distance-based Top-l- 
gram feature vector is shown in Figure 2. The time com- 
plexity of this algorithm is 0(L 2 ), where L is the length 
of the longest protein in the dataset. The source code 
can be downloaded at http://bioinformatics.hitsz.edu. 
cn/DistanceS VM/ index.jsp 

The Distance-based Residue approach (DR) is similar 
as the Distance-based Top-l-gram approach (DT), 
except that the native protein sequence was directly 
converted into the feature vector without replacing the 
amino acids with Top-l-grams. 

Construction of SVM classifiers and classification 

SVM learns a linear decision boundary from both posi- 
tive and negative training samples to discriminate 
between the unseen protein sequences. A key feature of 
SVM is that it needs fixed length of the input vector. 
The proteins in the training set and test set were trans- 
formed into fixed-dimension feature vectors following 
the process introduced above, and then the training vec- 
tors were input into SVM to construct the classifier. 
The SVM gives a discriminative score for each protein 
in the test set, which indicates the class level of the pro- 
tein. In order to have better comparability with other 
SVM-based methods, we employed the publicly available 
Gist SVM package version 2.3 (http://www.chibi.ubc.ca/ 
gist/index. html). The SVM parameters were used by 
default of the Gist Package. 

Evaluation methodology 

In order to evaluate the performance of SVM-based 
methods applied in unbalanced dataset, we applied 
receiver operating characteristics (ROC) score and 
ROC50 score to measure the performance of different 



methods. The ROC score is the normalized area under a 
curve that plots true positives against false positives for 
different possible thresholds for classification and the 
ROC50 score is the area under the ROC curve up to the 
first 50 false positives. The discriminative score obtained 
from the SVM classifier can be used to calculate the 
ROC score and ROC50 score. 

Results and discussion 

The impact of d MAX value on the performance of SVM-DT 
and SVM-DR 

There is a parameter d MAX in the proposed methods 
(see method section for details), which would impact on 
the predictive performance of the proposed methods 
SVM-DT and SVM-DR. d max can be any integer 
between 0 and the length of the longest protein 
sequence in the dataset. Figure 3 descripts the average 
ROC scores of the two methods with different d max 
values. The performance of the two methods improves 
quickly with the increment of d max from 0 to 100, and 
the performance of both the two methods turns stable 
with the d max in the range of [100, 200]. These results 
reveal that the distance-based approaches are relevant 
for discrimination. In real world application, smaller 
d max is preferred because it leads to shorter feature vec- 
tors, and therefore less computational and space cost is 
needed. In this study, the d max was set as 150 consider- 
ing the trade-off between performance and computa- 
tional cost. 

Comparative results of previous approaches 

Nine state-of-the-art protein remote homology detection 
methods were selected to compare with the proposed 
methods, including Monomer-dist [30], SVM-Top-n-gram 
[28], SVM-Top-n-gram-LSA [28], SVM-PDT-Profile [29], 
PseAACIndex-Porfile [41], SVM-LA [18], SVM-Pairwise 
[16], BioSVM-2L [42], and HHSearch [4]. HHSearch is a 
generative method, and the other eight methods as well as 
the proposed SVM-DR and SVM-DT are discriminative 
methods based on SVM. They are different in the strate- 
gies of constructing the feature vectors and kernel func- 
tions. The feature vector of Monomer-dist was based on 
the distances between short oligomers. SVM-Top-n-gram 
constructed the feature vectors by the occurrences of 
Top-n-grams, which can be viewed as a profile-based 
"building block" of proteins. SVM-Top-n-gram-LSA 
further improved this method by employing the Latent 
Semantic Analysis (LSA). SVM-PDT-Profile combined the 
amino acid physicochemical properties in the Amino Acid 
Index (AAIndex) [43] with the frequency profiles for the 
prediction. PseAACIndex-Porfile combined the Pseudo 
Amino Acid Composition (PseAAC) proposed by Chou 
with profile-based protein representation to convert pro- 
teins into fixed length vectors. The kernel of SVM-LA 
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Figure 1 The process of generating Distance-based Top-1-gram feature vector. A protein 5 is input into the PSI-BLAST software to do the 
multiple sequence alignments against a non-redundant database, and then the frequency profile is calculated from the multiple sequence 
alignments. The frequencies of the 20 standard amino acids in each column of the frequency profile are sorted in descending order. Top-1-gram 
is the most frequent amino acid in each column of frequency profile. 5 can be represented as a sequence of Top-1-grms 5' by combining all the 
obtained Top-1 -grams according to their sequence order. Assuming that the distance threshold d MAX is set as 2, the feature vector is the 
combination of Top-1 -gram pairs at distance 0, 1, and 2. 
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Algorithm 1 Construct Feature Vector of Distant-based Top-l-grams 



Input: 

S': the sequence of Top-l-grams 
d\ux. distance threshold djatx 
Output: 

fealureVecror- Distant-based Feature Vector of 5 ' 

1 : for i = 0 to 20 + 20 *20 * diax do // Initialize 
2 : feature Vector [ i] = 0 
3: end 

4: for Position = 0 to S'. Length - 1 do //Calculate the occurrences of distance 0 
5: featureVecior[index(Top-l-gramAtPosition)] +- 1 
6: end 

7: for d = 1 toduAX&& d< ■? '. Length &<> 

8: for FirstPosition - 0 to S '.Length - 1 - d do 

9: SecondPosition - FirstPosition + d 

10: indexF- index(Top-l-gram-itFirstPosition) 

11: indexS - index(Top-l-gramAtSecoondPosition) 

12: featureVector[20 + (d- 1) * 20 : + indexF * 20 + indexSJ += 1 

13: end 

14: end 

Figure 2 Algorithm of construing the Distance-based Topo- 
gram feature vector. The input of this algorithm is the Top-1-gram 
sequence S', distance threshold d MAX , and the output is the feature 
vector of distance-based Top-1 -grams. The vector of alphabet Index 
0 is the index of all the Top-1 -gram in the alphabet A and 20 is the 
size of A for example, index 0 indicates the first Top-1-gram in the 
alphabet A(t, = A), and index 19 is the last Top-1 -gram in the 
alphabet A{t, 9 = V). 



measured the similarity between a pair of proteins by tak- 
ing into account all the optimal local alignment scores 
with gaps between all possible subsequences. BioSVM-2L 
constructed two-layer SVM classifiers with profile-based 
kernels. In SVM-Pairwise, each protein was represented as 
a vector of pairwise similarities to all proteins in the train- 
ing set. HHSearch is one of the best protein remote 
homology detection methods, which employed a novel 
profile-based hidden Markov models. 
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Figure 3 The average ROC scores of the SVM-DR and SVM-DT 
with different distance threshold values of d MAX . 



Table 1 shows the predictive results of the two pro- 
posed methods (SVM-DT, SVM-DR) and other nine 
related methods. Generally speaking, profile-based meth- 
ods are superior to sequence-based methods because 
they use the evolutionary information in profiles for 
protein remote homology detection. The proposed 
sequence-based method SVM-DR outperforms two 
sequence-based methods Monomer-dist, SVM-Pairwise, 
and is highly comparable with SVM-LA. For the profile- 
based methods, SVM-DT outperforms other methods 
except for SVM-PDT-Profile in terms of average ROC 
score. SVM-DT improves the SVM-Top-n-gram by con- 
sidering the Top-n-gram pairs at different distances. The 
experimental results demonstrate that this sequence- 
order information is relevant for discrimination and the 
average ROC and ROC50 scores can be improved by 
4.1% and 10.4%, respectively. 

Correlations between discriminative features and protein 
family 

According to the above results, the proposed Distance- 
based Top-n-gram (SVM-DT) method showed the best 
performance with low computational cost when the dis- 
tance threshold (Imax was taken as the value of 150. In 
order to further investigate the importance of the features 
and reveal the biological meaning of the feature space, we 
followed the study [29] to calculate the discriminant 
weight vector in the feature space. The sequence-specific 
weight obtained from the SVM training process can 
be used to calculate the discriminant weight of each fea- 
ture to measure the importance of the features. Given the 



Table 1 Average ROC and ROC50 scores over 54 families 
for different methods. 



Methods 


ROC 


ROC50 


Profile 


Sequence 


Source 


SVM-DR 


0.919 


0.715 


No 


Yes 


This 
study 


Monomer-dist 


0.919 


0.508 


No 


Yes 


[30] 


SVM-LA (8 = 0.5) 


0.925 


0.649 


No 


Yes 


[18] 


SVM-Pairwise 


0.901 


0.399 


No 


Yes 


[16] 


SVM-DT 


0.948 


0.800 


Yes 


No 


This 
study 


SVM-Top-n-gram 


0.907 


0.696 


Yes 


No 


[28] 


SVM-Top-n-gram-LSA 


0.939 


0.767 


Yes 


No 


[28] 


SVM-PDT-Profile (fi = 8, 

17 = 2) 


0.950 


0.740 


Yes 


No 


[29] 


PseAACIndex-Porfile 
(A = 5) 


0.922 


0.712 


Yes 


No 


[41] 


BioSVM-2L (1st+2nd 
layers) 


0.927 


0.888 


Yes 


No 


[42] 


HHSearch 


0.915 


0.990 


Yes 


No 


[4] 



The columns "Profile" and "Sequence" denote whether the method belongs to 
a class ("Yes") or not ("No"), where "Profile" indicates the method uses profile 
to extract the features, "Sequence" means that the method only uses the 
sequence composition of proteins to construct the feature vectors. 
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weight vectors of the training set with N samples obtained 
from the kernel-based training a= [a 1( a 2 , cc 3 ,..., a N ], the 
discriminant weight vector w in the feature space can be 
calculated by the following equation: 

w = M*a (3) 

Where M is the matrix of sequence representatives. 
The magnitude of the element in w represents the dis- 
criminative power of the corresponding feature. We 
used the L 2 -norm of the discriminant weight vector w 
of each Top- 1 -gram pairs and residue pairs to measure 
the importance of the features. 

Family 2.5.1.3 was selected as a target family for the 
feature analysis. The predictive results of SVM-DT on 
this family are obviously higher than those of the SVM- 
DR (0.993 VS 0.844 in terms of average ROC score). 
The L 2 -norm of 400 Top-l-gram pairs and residue pairs 
for these two methods are depicted in Figure 4. Accord- 
ing to the figure, interestingly, the top two most discri- 
minative pairs are (G, G), (L, L) for both of the two 
methods (the two darkest spots in each subfigure of 
Figure 4), which indicates the importance of amino acid 
G (Glycine) and L (Leucine). The strong discriminative 
power of Top-l-gram pair (G, G) on protein family 
2.5.1.3 (Multidomain cupredoxins) is not surprising, 
because highly conserved sequence PHGGGWGQ are 
abundant in cupredoxins [44], and Top-l-gram pair (G, G) 
can capture this sequence pattern, which would be the rea- 
son for better performance of SVM-DT on this protein 
family. 

Figure 5(A) and 5(B) show the discriminant weights of 
the top two most important Top-l-gram pairs (G, G) 
and (L, L) of the SVM-DT on family 2.5.1.3, and the 



discriminant weights of the top two most important 
residue pairs (G, G) and (L, L) of the SVM-DR are 
shown in Figure 2(C) and 2(D), respectively. As indi- 
cated by the above results, short distances are more 
important for discrimination than longer distances, 
which coincides with the ladder- shaped structure of dis- 
criminant values for distances. These results demon- 
strate that using the Top-l-gram pairs with distances 
shorter than a given distance threshold d MAX to con- 
struct the feature vector is an efficient strategy to reduce 
computational cost, because shorter distances are more 
important than longer distances for protein remote 
homology detection. Figure 5(A) shows the discriminant 
weight of pair (G, G) of SVM-DT and the magnitude of 
zero-distance shows the importance of Glycine fre- 
quency for discrimination. Most of the distances 
between Top-l-gram pairs (G, G) show positive discri- 
minative power, while only a few distances show nega- 
tive discriminative power, such as the distances 3, 25, 
26. Figure 5(B) shows the discriminant weight of pair 
(L, L) of SVM-DT, which shows similar patterns. Note 
that the Top-l-gram pairs with zero-distance always 
show higher discriminative power than other distance 
values for both of the two features, indicating the local 
structure, especial the subsequence of proteins are very 
important for protein remote homology detection. 
Figure 5(C) and 5(D) show the discriminant weights of 
the top two most important residue pairs of the SVM- 
DR on family 2.5.1.3 after SVM training process. These 
two subfigures also show similar ladder-shaped struc- 
ture, but there are more features show positive discrimi- 
native power than those in the SVM-DT as shown in 
Figure 5(A) and 5(B). 



( \ 



(A) L2-norm of Top- 1 -gram pairs (B) L2-norm of residue pairs 




ARDCQEHI GNL KMF PSTWYV ARDCQEHI GNL KMF PSTWYV 

first Top- 1 -gram first residue 



Figure 4 The discriminative power (L 2 -norm) of discriminant vectors for all possible combinations of Top-1-gram pair (A) and residue 
pair (B) of protein family 2.5.1.3. The amino acids are identified by their one-letter code. The amino acids labeled by x-axis and y-axis in 
figure(A) indicate the first Top-1-gram and the second Top-1-gram in Top-1-gram pairs of SVM-DT, respectively; the amino acids labeled by x-axis 
and y-axis in figure (B) indicate the first residue and the second residue in residue pairs of SVM-DR, respectively. The adjacent color bar shows 
the mapping of L 2 -norm values. 
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(A) Top-l-gram pair (G,G) of SVM-DT (B) Top-l-gram pair (L,L) of SVM-DT 
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(C) Residue pair (G,G) of SVM-DR 
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(D) Residue pair (L,L) of SVM-DR 
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Figure 5 The discriminant weights of the most discriminative Top-1-gram pairs (G, G) and (L, L) of SVM-DT for family 2.5.1.3 are 
shown in figure (A) and (B), respectively; the discriminant weights of the most discriminative residue pairs (G, G) and (L, L) of SVM- 
DR for family 2.5.1.3 are shown in figure (C) and (D), respectively 



Conclusion 

In this study, we proposed two methods SVM-DT and 
SVM-DR for protein remote homology detection, in 
which the feature vectors were constructed based on the 
occurrences of Top-n-gram pairs or residue pairs at dis- 
tances shorter than a distance threshold (Imax- These 
approaches can be viewed as position dependant meth- 
ods that incorporate the sequence-order information. 
SVM-DR is a sequence-based method, its advantage is 
that it doesn't need time consuming multiple sequence 
alignment step. SVM-DT is a profile-based method, 
which achieves more accurately predictive performance 
but higher computational cost is required due to the 



generation of Top-n-grams. Recently, position depen- 
dant methods have been attracted much attention. 
Remote homology proteins share low sequence similar- 
ity, and therefore, structure information is a key to 
improve the predictive performance. These position 
dependant methods partly incorporate the structure 
information by considering the relative orders of resi- 
dues or other building blocks of proteins occurring in 
protein sequences, such as Monomer-dist proposed by 
Linger et al [30]. This method used the distances 
between short oligomers to produce the feature vectors, 
which gave rise to very high-dimensional feature vectors. 
In contract, SVM-DR efficiently reduced the dimension 
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of feature vectors by only considering the residue pairs 
at distances shorter than a distance threshold d MAX - 
SVM-DT further improved SVM-DR by using Top-n- 
grams to replace the residues in proteins and produced 
feature vectors based on Top-n-gram distances. This 
profile-based method used the evolutionary information 
in profiles and therefore showed better performance 
than the sequence-based methods and the position inde- 
pendent methods, such as SVM-Top-n-gram [28], indi- 
cating that the distance-based approaches are relevant 
for discrimination. Recent studies showed that besides 
sequence and profile information, other features describ- 
ing the physicochemical properties of amino acids can 
accurately detect the protein homologies, such as the 
amino acid index (AAIndex) [29,41]. We are looking 
forward to incorporating these features into the pro- 
posed distance-based framework and exploring new 
mathematical and statistical models for the representa- 
tion of protein sequences. 

Competing interests 

The authors declare that they have no competing interests. 
Authors' contributions 

BL conceived of the study and carried out the protein remote homology 
detection study, participated in designing the study, coding the 
experiments, drafting the manuscript and performing the statistical analysis. 
JHX participated in coding the experiments and drafting the manuscript. 
RFX, QZ, XLW, QCC participated in performing the statistical analysis. All 
authors read and approved the final manuscript. 

Acknowledgements 

This work was supported by the National Natural Science Foundation of 
China (No. 61300112, No. 61370165, No. 61173075, No. 61272383, and No. 
61370010), the Natural Science Foundation of Guangdong Province (No. 
S20 12040007390 and No.S201 301 001 4475), the Scientific Research Innovation 
Foundation in Harbin Institute of Technology (Project No. HIT. 
NSRIF.2013103), the Shanghai Key Laboratory of Intelligent Information 
Processing, China (Grant No. IIPL-201 2-002). MOE Specialized Research Fund 
for the Doctoral Program of Higher Education 20122302120070, Shenzhen 
Foundational Research Funding JCYJ201 2061 31 52557576, Shenzhen 
International Cooperation Research Funding GJHZ201 2061 3 1 10641217, and 
Open Projects Program of National Laboratory of Pattern Recognition. 

Declarations 

The publication costs for this article were funded by the corresponding 
author. 

This article has been published as part of BMC Bioinformatics Volume 15 
Supplement 2, 2014: Selected articles from the Twelfth Asia Pacific 
Bioinformatics Conference (APBC 2014): Bioinformatics. The full contents of 
the supplement are available online at http://www.biomedcentral.com/ 
bmcbioinformatics/supplements/15/S2. 

Authors' details 

'School of Computer Science and Technology, Harbin Institute of 
Technology Shenzhen Graduate School, Shenzhen, Guangdong, China. 2 Key 
Laboratory of Network Oriented Intelligent Computation, Harbin Institute of 
Technology Shenzhen Graduate School, Shenzhen, Guangdong, China. 
Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, 
China. 4 School of Information Science and Technology, Xiamen University, 
Xiamen, Fujian, China. 

Published: 24 January 2014 



References 

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment 
Search Tool. J Mol Biol 1 990, 21 5(3)403-41 0. 

2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, 
Lipman DJ: Gapped BLAST and PSI-BLAST: A New Generation of Protein 
Database Search Programs. Nucleic Acids Res 1997, 25(17):3389-3402. 

3. Karplus K, Barrett C, Hughey R: Hidden Markov Models for Detecting 
Remote Protein Homologies. Bioinformatics 1998, 14(10):846-856. 

4. Sading J: Protein Homology Detection by HMM-HMM Comparison. 
Bioinformatics 2005, 21 (9):95 1-960. 

5. Sadreyev Rl, Tang M, Kim B-H, Grishin NV: COMPASS Server for Homology 
Detection: Improved Statistical Accuracy, Speed and Functionality. 
Nucleic Acids Res 2009, 37(Web Server):W90-W94. 

6. Jaroszewski L, Z ZL, Cai X-H, Weber C, Godzik A: FFAS Server: 
Novelfeatures and Applications. Nucleic Acids Res 201 1, 39(Web Server): 
W38-W44. 

7. Tomii K, Akiyama Y: FORTE: a Profile-Profile Comparison Tool for Protein 
Fold Recognition. Bioinformatics 2004, 20(4)594-595. 

8. Noble WS, Kuang R, Leslie C, Weston J: Identifying Remote Protein 
Homologs by Network Propagation. The FEBS journal 2005, 

272(20)5119-5128. 

9. Brandt BW, Heringa J: WebPRC: The Profile Comparer for Alignment- 
Based Searching of Public Domain Databases. Nucleic Acids Res 2009, 
37(Web Server):W48-W52. 

10. Kelley LA, Sternberg MJ: Protein Structure Prediction on The Web: A Case 
Study Using The Phyre Server. Nat Protoc 2009, 4(3)363-371. 

1 1 . Lobley A, Sadowski MJ, Jones DT pGenTH READER and pDomTHREADER: 
New Methods for Improved Protein Fold Recognition and Superfamily 
Fiscrimination. Bioinformatics 2009, 25(14):1 761 -1767. 

12. Margelevicius M, Venclovas MLC: COMA Server for Protein Distant 
Homology Search. Bioinformatics 2010, 26(15):! 905-1 906. 

13. Gront D, Blaszczyk M, Wojciechowski P, Kolinski A: BioShell Threader: 
Protein Homology Detection Based on Sequence Profiles and Secondary 
Structure Profiles. Nucleic Acids Res 2012, 40(Web Server):W257-W262. 

14. Noble WS, Pavlidis P: Support Vector Machine and Kernel Principal 
Components Analysis Software Toolkit. Columbia University 2002. 

1 5. Jaakkola T, Diekhans M, Haussler D: Using the Fisher Kernel Method to 
Detect Remote Protein Homologies. Proceedings of the Seventh International 
Conference on Intelligent Systems for Molecular Biology 1 999, 1 49-1 58. 

16. Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support 
Vector Machines for Detecting Remote Protein Evolutionary and 
Structural Relationships. J Comput Biol 2003, 10(6):857-868. 

17. Rost B: Twilight zone of protein sequence alignments. Protein Eng 1999, 
12(2):85-94. 

18. Saigo H, Vert JP, Ueda N, Akutsu T: Protein Homology Detection Using 
String Alignment Kernels. Bioinformatics 2004, 20(1 1):1 682-1 689. 

19. Shah AR, Oehmen CS, Webb-Robertson B-J: SVM-HUSTLE-an Iterative 
Semi-Supervised Machine Learning Approach for Pairwise Protein 
Remote Homology Detection. Bioinformatics 2008, 24(6)783-790. 

20. Ben-Hur A, Brutlag D: Remote Homology Detection: A Motif Based 
Approach. Bioinformatics 2003, 19(Suppl 1):i26-i33. 

21. Leslie C, Eskin E, Noble WS: The Spectrum Kernel: A String Kernel for svm 
Protein Classification. Proc Pacific Symposium on Biocomputing 2002, 
566-575. 

22. Hou Y, Hsu W, Lee ML, Bystroff C: Efficient Remote Homology Detection 
Using Local Structure. Bioinformatics 2003, 19(17)2294-2301. 

23. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch String Kernels 
for Discriminative Protein Classification. Bioinformatics 2004, 20(4)467-476. 

24. Dong QW, Wang XL, Lin L: Application of Latent Semantic Analysis to 
Protein Remote Homology Detection. Bioinformatics 2006, 22(3)285-290. 

25. Ogul H, Mumcuoglu EU: A discriminative method for remote homology 
detection based on n-peptide compositions with reduced amino acid 
alphabets. BioSystems 2007, 87(1):75-81. 

26. Rangwala H, Karypis G: Profile-Based Direct Kernels for Remote Homology 
Detection and Fold Detection. Bioinformatics 2005, 21(23)4239-4247. 

27. Kuang R, le E, Wang K, Wang K, Siddiqi M: Profile-Based Direct Kernels for 
Remote Homology Detection and Motif Extraction. J Bioinform Comput 
Biol 2005, 3(3)527-550. 

28. Liu B, Wang X, Lin L, Dong Q, Wang X: A Discriminative Method for 
Protein Remote Homology Detection and Fold Recognition Combining 



Liu et al. BMC Bioinformatics 2014, 15(Suppl 2):S3 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/S2/S3 



Page 10 of 10 



Top-n-grams and Latent Semantic Analysis. BMC Bioinformatics 2008, 
9:510. 

29. Liu B, Wang X, Chen Q, Dong Q, Lan X: Using Amino Acid 
Physicochemical Distance Transformation for Fast Protein Remote 
Homology Detection. PLoS ONE 2012, 7(9):e46633. 

30. LingnerT, Meinicke P: Remote Homology Detection Based on Oligomer 
Distances. Bioinformatics 2006, 22(18)2224-2231. 

31. Liu X, Zhao L, Dong Q: Protein Remote Homology Detection Based on 
Auto-Cross Covariance Transformation. Computers in Biology and Medicine 
2011,41(8)640-647. 

32. Hou Y, Hsu W, Lee L, Bystroff C: Remote Homolog Detection Using Local 
Sequence-Structure Correlations. Proteins 2004, 57(3)518-530. 

33. Yang Y, Tantoso E, Li K B: Remote Protein Homology Detection Using 
Recurrence Quantification Analysis and Amino Acid Physicochemical 
Properties. Journal of Theoretical Biology 2008, 252(1 ):1 45-1 54. 

34. Zou Q, Wang Z, Wu Y, Liu B, Lin Z, Guan X: An Approach for Identifying 
Cytokines Based On a Novel Ensemble Classifier. BioMed Research 
International 2013, 686090. 

35. Zhang Y, Liu B, Dong Q, Jin VX: An improved profile-level domain linker 
propensity index for protein domain boundary prediction. Protein and 
Peptide Letters 2011, 1 8(1 ):7-l 6. 

36. Liu B, Wang X, Lin L, Tang B, Dong Q, Wang X: Prediction of protein 
binding sites in protein structures using hidden Markov support vector 
machine. BMC Bioinformatics 2009, 10:381. 

37. Liu B, Wang X, Lin L, Dong Q, Wang X: Exploiting three kinds of interface 
propensities to identify protein binding sites. Computational Biology and 
Chemistry 2009, 33(4)303-311. 

38. Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, Dong Q, Chou K-C: 
Combining evolutionary information extracted from frequency profiles 
with sequence-based kernels for protein remote homology detection. 
Bioinformatics , DOI: btt709. 

39. Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: 
SCOP Database in 2004: Refinements Integrate Structure and Sequence 
Family Data. Nucleic Acids Research 2004, 32(Database):D226-D229. 

40. Brenner SE, Koehl P, M ML: The ASTRAL Compendium for Sequence and 
Structure Analysis. Nucleic Acids Res 2000, 28(1)254-256. 

41. Liu B, Wang X, Zou Q, Dong Q, Chen Q: Protein Remote Homology 
Detection by Combining Chou's Pseudo Amino Acid Composition and 
Profile-Based Protein Representation. Molecular Informatics 2013, 
32:775-782. 

42. Muda HM, Saad P, Othman RM: Remote Protein Homology Detection and 
Fold Recognition Using Two-Layer Support Vector Machine Classifiers. 

Computers in Biology and Medicine 201 1 , 41 (8):687-699. 

43. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, 
Kanehisa M: AAindex: Amino Acid Index Database, Progress Report 2008. 
Nucleic Acids Res 2008, 36(Database):D202-D205. 

44. Burns CS, Aronoff-Spencer E, Dunham CM, Lario P, Avdievich Nl, 
Antholine WE, Olmstead (VIM, Vrielink A, Gerfen GJ, Peisach J, et ah 
Molecular Features of the Copper Binding Sites in the Octarepeat 
Domain of the Prion Protein. Biochemistry 2002, 41(12)3991-4001. 



doi:1 0.1 1 86/1 471-21 05-1 5-S2-S3 

Cite this article as: Liu et al: Using distances between Top-n-gram and 
residue pairs for protein remote homology detection. BMC Bioinformatics 
2014 15(Suppl 2):S3. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at f \ RioM _-| rpntr ,i 

www.biomedcentral.com/submit \ J Blomea ^e™ 31 



